BatFrame Tutorial

Interactions with MLDB occurs via a REST API. Interacting with a REST API over HTTP from a Notebook interface can be a little bit laborious if you're using a general-purpose Python library like requests directly, so MLDB comes with a Python library called pymldb to ease the pain.

pymldb does this in three ways:

  • the Python BatFrame class: this is a class that behaves like the Pandas DataFrame but offloads computation to the server via HTTP calls. This tutorial shows you how to use it.
  • the Python Resource class: this is simple class which wraps the requests library so as to make HTTP calls to the MLDB API more friendly in a Notebook environment. Check out the Resource Wrapper Tutorial for more info on the Resource class.
  • the %mldb magics: these are Jupyter line- and cell-magic commands which allow you to make raw HTTP calls to MLDB, and also provides some higher-level functions. Check out the Cell magic Tutorial for more info on the %mldb magic system.

BatFrame intro

The pymldb library includes a system similar to the pandas DataFrame called the BatFrame. If you are familiar with Pandas' DataFrame then you will feel right at home. The BatFrame allows you to interact with the database in a familiar syntax. Every dataset can be conceptualised as a matrix of values like any SQL tables, but with the addition of a time component. For now we will leave the time component outside of the BatFrame exploration and focus on the well know row/column format of a table.

This tutorial will walk you through the various features of the BatFrame.

Creating a dataset by magic

We'll use the %mldb magic commands to interact with the REST API to delete the dataset if it's already been loaded, and then run a remote Python script to load the data. See the Cell Magic Tutorial for more details or execute %mldb help after running %reload_ext pymldb.


In [1]:
%reload_ext pymldb
%mldb DELETE /v1/datasets/superheroes_and_villains
%mldb py http://opensource.datacratic.com/mldb-demo-resources/import_superheroes_and_villains.py


mldb magic initialized with host as http://localhost

Initializing the BatFrame


In [2]:
from pymldb.data import BatFrame

The BatFrame is instantiated with the address of a dataset. Here we instantiate one on the dataset that was created above. Pretty straight forward.


In [3]:
bf = BatFrame("http://localhost/v1/datasets/superheroes_and_villains")

In [4]:
type(bf)


Out[4]:
mldb.data.BatFrame

In [5]:
bf


               Bad   Eyes  Gender  Good   Hair  Height   Skin  Weight
rowName                                                              
Superman         0   Blue    Male     1  Black      75    NaN     235
Power Girl       0   Blue  Female     1  Blond      71    NaN     180
Supergirl        0   Blue  Female     1  Blond      65    NaN     135
Batwoman         0   Blue  Female     1  Black      66    NaN     123
Wonder Woman     0   Blue  Female     1  Black      72    NaN     130
Catwoman         1  Green  Female     1  Black      67    NaN     133
The Joker        1  Green    Male     0  Green      72  White     160
Harley Quinn     1   Blue  Female     0  Blond      67    NaN     140
Nightwing        0   Blue    Male     1  Black      70    NaN     175
Oracle           0   Blue  Female     1    Red      67    NaN     126
Batman           0   Blue    Male     1  Black      74    NaN     210
The Flash        0   Blue    Male     1  Blond      72    NaN     195
Aquaman          0   Blue    Male     1  Blond      73    NaN     325
Zatanna          0   Blue  Female     1  Black      67    NaN     127
Green Lantern    0  Brown    Male     1  Brown      72    NaN     200
Black Canary     0   Blue  Female     1  Blond      67    NaN     130
Out[5]:

Column access

We can first get a list of the available columns with


In [6]:
bf.columns


Out[6]:
[u'Weight', u'Bad', u'Hair', u'Good', u'Skin', u'Eyes', u'Height', u'Gender']

Just like you would expect, you use the square brackets with the name of the column.


In [7]:
bf["Skin"]


Superman          None
Power Girl        None
Supergirl         None
Batwoman          None
Wonder Woman      None
Catwoman          None
The Joker        White
Harley Quinn      None
Nightwing         None
Oracle            None
Batman            None
The Flash         None
Aquaman           None
Zatanna           None
Green Lantern     None
Black Canary      None
dtype: object
Out[7]:

Selecting one column will return a Column object that can later be used in other calculations.


In [8]:
height = bf["Height"]
type(height)


Out[8]:
mldb.data.Column

Or a list of columns. The columns don't need to be consecutive. It is totally independent of the order the might have been inserted in the dataset. So imagine for example that you wanted to print the values of a few columns, you can do so in whatever order you want.


In [9]:
bf[["Gender", "Height", "Eyes"]]


                Eyes  Gender  Height
rowName                             
Superman        Blue    Male      75
Power Girl      Blue  Female      71
Supergirl       Blue  Female      65
Batwoman        Blue  Female      66
Wonder Woman    Blue  Female      72
Catwoman       Green  Female      67
The Joker      Green    Male      72
Harley Quinn    Blue  Female      67
Nightwing       Blue    Male      70
Oracle          Blue  Female      67
Batman          Blue    Male      74
The Flash       Blue    Male      72
Aquaman         Blue    Male      73
Zatanna         Blue  Female      67
Green Lantern  Brown    Male      72
Black Canary    Blue  Female      67
Out[9]:

Row access

Similar to the column slicing and indexing, you can view rows via the ix property of the batframe


In [10]:
bf.ix["Batman"]


         Bad  Eyes Gender  Good   Hair  Height  Weight
rowName                                               
Batman     0  Blue   Male     1  Black      74     210
Out[10]:


In [11]:
bf.ix[["Batman", "Superman"]]


          Bad  Eyes Gender  Good   Hair  Height  Weight
rowName                                                
Superman    0  Blue   Male     1  Black      75     235
Batman      0  Blue   Male     1  Black      74     210
Out[11]:

You can also select portions of you data by slicing the rows and column via the ix property.


In [12]:
bf.ix["Batman", "Height"]


Batman    74
dtype: int64
Out[12]:

Binary arithmetics

If your columns contain integers or float, you can easily do binary arithmetics on them. For example, the Height column is in inches. Let's say you wanted it in meters instead

You can do any of these:

  • Addition (+)
  • Substraction (-)
  • Multiplication (*)
  • Division (/)
  • Modulo (%)
  • Power (**)

In [13]:
height = (bf["Height"] * 2.54)/100
height


Superman         1.9050
Power Girl       1.8034
Supergirl        1.6510
Batwoman         1.6764
Wonder Woman     1.8288
Catwoman         1.7018
The Joker        1.8288
Harley Quinn     1.7018
Nightwing        1.7780
Oracle           1.7018
Batman           1.8796
The Flash        1.8288
Aquaman          1.8542
Zatanna          1.7018
Green Lantern    1.8288
Black Canary     1.7018
dtype: float64
Out[13]:

Of if you wanted the Weight in kilograms instead of pounds


In [14]:
weight = bf["Weight"] / 2.2
weight


Superman         106.818182
Power Girl        81.818182
Supergirl         61.363636
Batwoman          55.909091
Wonder Woman      59.090909
Catwoman          60.454545
The Joker         72.727273
Harley Quinn      63.636364
Nightwing         79.545455
Oracle            57.272727
Batman            95.454545
The Flash         88.636364
Aquaman          147.727273
Zatanna           57.727273
Green Lantern     90.909091
Black Canary      59.090909
dtype: float64
Out[14]:

For example, let's calculate the BMI (Body Mass Index) of every superheroes and villains at our disposal


In [15]:
bmi = weight / (height**2)
bmi


Superman         29.434402
Power Girl       25.157396
Supergirl        22.512148
Batwoman         19.894230
Wonder Woman     17.668035
Catwoman         20.874300
The Joker        21.745274
Harley Quinn     21.972947
Nightwing        25.162388
Oracle           19.775653
Batman           27.018780
The Flash        26.502052
Aquaman          42.968236
Zatanna          19.932602
Green Lantern    27.181592
Black Canary     20.403451
dtype: float64
Out[15]:

And as you would expect, you cannot divide by zero


In [16]:
import traceback
try:
    bf["Weight"] / 0
except ValueError:
    print traceback.format_exc()


Traceback (most recent call last):
  File "<ipython-input-16-fa3327c56013>", line 3, in <module>
    bf["Weight"] / 0
  File "/usr/local/lib/python2.7/dist-packages/mldb/data.py", line 323, in __div__
    "Cannot divide by zero. "
ValueError: Cannot divide by zero. Do you really want to explode the planet?

Unary arithmetics

Negate a column


In [17]:
-bf["Height"]


Superman        -75
Power Girl      -71
Supergirl       -65
Batwoman        -66
Wonder Woman    -72
Catwoman        -67
The Joker       -72
Harley Quinn    -67
Nightwing       -70
Oracle          -67
Batman          -74
The Flash       -72
Aquaman         -73
Zatanna         -67
Green Lantern   -72
Black Canary    -67
dtype: int64
Out[17]:

Filtering

Rich Comparison

Using the comparison operators on a column will return a Query object that is used by either a column or a BatFrame to filter out the values

You can do any of these:

  • Greater than (>)
  • Greater than or equal (>=)
  • Lesser than (<)
  • Lesser than or equal (<=)
  • Equal (==)
  • Not equal(!=)

In [18]:
type(bf["Height"] > 70)


Out[18]:
mldb.query.Query

In [19]:
bf[bf["Height"] > 70]


               Bad   Eyes  Gender  Good   Hair  Height   Skin  Weight
rowName                                                              
Superman         0   Blue    Male     1  Black      75    NaN     235
Power Girl       0   Blue  Female     1  Blond      71    NaN     180
Wonder Woman     0   Blue  Female     1  Black      72    NaN     130
The Joker        1  Green    Male     0  Green      72  White     160
Batman           0   Blue    Male     1  Black      74    NaN     210
The Flash        0   Blue    Male     1  Blond      72    NaN     195
Aquaman          0   Blue    Male     1  Blond      73    NaN     325
Green Lantern    0  Brown    Male     1  Brown      72    NaN     200
Out[19]:

Column containing boolean

Here we have the columns Good and Bad which contains boolean values. You can filter your BatFrame based on that as well


In [20]:
bf[bf["Good"]]


               Bad   Eyes  Gender  Good   Hair  Height  Weight
rowName                                                       
Superman         0   Blue    Male     1  Black      75     235
Power Girl       0   Blue  Female     1  Blond      71     180
Supergirl        0   Blue  Female     1  Blond      65     135
Batwoman         0   Blue  Female     1  Black      66     123
Wonder Woman     0   Blue  Female     1  Black      72     130
Catwoman         1  Green  Female     1  Black      67     133
Nightwing        0   Blue    Male     1  Black      70     175
Oracle           0   Blue  Female     1    Red      67     126
Batman           0   Blue    Male     1  Black      74     210
The Flash        0   Blue    Male     1  Blond      72     195
Aquaman          0   Blue    Male     1  Blond      73     325
Zatanna          0   Blue  Female     1  Black      67     127
Green Lantern    0  Brown    Male     1  Brown      72     200
Black Canary     0   Blue  Female     1  Blond      67     130
Out[20]:

You can even combine boolean columns if you wish


In [21]:
bf[bf["Good"] & bf["Bad"]]


          Bad   Eyes  Gender  Good   Hair  Height  Weight
rowName                                                  
Catwoman    1  Green  Female     1  Black      67     133
Out[21]:

And we see that only Catwoman is both good and bad because she is at times a thief and at other times romantically involved with Batman and helping him.

Or you can do things a bit more complicated like who are the Superheroes (Good and not Bad) that are at least 73 inches or shorter than 70.


In [22]:
bf[(bf["Good"] & ~bf["Bad"]) & ((bf["Height"] >= 73) | (bf["Height"] < 70))]


              Bad  Eyes  Gender  Good   Hair  Height  Weight
rowName                                                     
Superman        0  Blue    Male     1  Black      75     235
Supergirl       0  Blue  Female     1  Blond      65     135
Batwoman        0  Blue  Female     1  Black      66     123
Oracle          0  Blue  Female     1    Red      67     126
Batman          0  Blue    Male     1  Black      74     210
Aquaman         0  Blue    Male     1  Blond      73     325
Zatanna         0  Blue  Female     1  Black      67     127
Black Canary    0  Blue  Female     1  Blond      67     130
Out[22]:

Basic functionnalities

Getting the min/max value of a column


In [23]:
bf["Height"].min()


Out[23]:
65

In [24]:
bf["Weight"].min()


Out[24]:
123

Printing only the head of the dataset because it can be too big to fit in memory. This works on both BatFrame and Column objects


In [25]:
bf.head()


              Bad  Eyes  Gender  Good   Hair  Height  Weight
rowName                                                     
Superman        0  Blue    Male     1  Black      75     235
Power Girl      0  Blue  Female     1  Blond      71     180
Supergirl       0  Blue  Female     1  Blond      65     135
Batwoman        0  Blue  Female     1  Black      66     123
Wonder Woman    0  Blue  Female     1  Black      72     130
Out[25]:


In [26]:
bf["Eyes"].head()


Out[26]:
Superman        Blue
Power Girl      Blue
Supergirl       Blue
Batwoman        Blue
Wonder Woman    Blue
dtype: object

Sometimes you just want the unique values that a column contains to see, for example, the range of values


In [27]:
bf["Eyes"].unique()


Out[27]:
[u'Blue', u'Brown', u'Green']

Sorting

Both the BatFrame and Column can be sorted. Let's start with the simplest one, Column.


In [28]:
# sort takes an optional argument ascending. By default it's True. 
# Puting it at False would return the values in descending order
bf["Height"].sort()


Supergirl        65
Batwoman         66
Catwoman         67
Harley Quinn     67
Oracle           67
Zatanna          67
Black Canary     67
Nightwing        70
Power Girl       71
Wonder Woman     72
The Joker        72
The Flash        72
Green Lantern    72
Aquaman          73
Batman           74
Superman         75
dtype: int64
Out[28]:

The BatFrame offers a bit more versatilty. You can sort on one particular column


In [29]:
bf.sort("Height")


               Bad   Eyes  Gender  Good   Hair  Height   Skin  Weight
rowName                                                              
Supergirl        0   Blue  Female     1  Blond      65    NaN     135
Batwoman         0   Blue  Female     1  Black      66    NaN     123
Catwoman         1  Green  Female     1  Black      67    NaN     133
Harley Quinn     1   Blue  Female     0  Blond      67    NaN     140
Oracle           0   Blue  Female     1    Red      67    NaN     126
Zatanna          0   Blue  Female     1  Black      67    NaN     127
Black Canary     0   Blue  Female     1  Blond      67    NaN     130
Nightwing        0   Blue    Male     1  Black      70    NaN     175
Power Girl       0   Blue  Female     1  Blond      71    NaN     180
Wonder Woman     0   Blue  Female     1  Black      72    NaN     130
The Joker        1  Green    Male     0  Green      72  White     160
The Flash        0   Blue    Male     1  Blond      72    NaN     195
Green Lantern    0  Brown    Male     1  Brown      72    NaN     200
Aquaman          0   Blue    Male     1  Blond      73    NaN     325
Batman           0   Blue    Male     1  Black      74    NaN     210
Superman         0   Blue    Male     1  Black      75    NaN     235
Out[29]:

Or a list of columns. The batframe will be ordered based on the order of the list


In [30]:
bf.sort(["Height", "Eyes"])


               Bad   Eyes  Gender  Good   Hair  Height   Skin  Weight
rowName                                                              
Supergirl        0   Blue  Female     1  Blond      65    NaN     135
Batwoman         0   Blue  Female     1  Black      66    NaN     123
Harley Quinn     1   Blue  Female     0  Blond      67    NaN     140
Oracle           0   Blue  Female     1    Red      67    NaN     126
Zatanna          0   Blue  Female     1  Black      67    NaN     127
Black Canary     0   Blue  Female     1  Blond      67    NaN     130
Catwoman         1  Green  Female     1  Black      67    NaN     133
Nightwing        0   Blue    Male     1  Black      70    NaN     175
Power Girl       0   Blue  Female     1  Blond      71    NaN     180
Wonder Woman     0   Blue  Female     1  Black      72    NaN     130
The Flash        0   Blue    Male     1  Blond      72    NaN     195
Green Lantern    0  Brown    Male     1  Brown      72    NaN     200
The Joker        1  Green    Male     0  Green      72  White     160
Aquaman          0   Blue    Male     1  Blond      73    NaN     325
Batman           0   Blue    Male     1  Black      74    NaN     210
Superman         0   Blue    Male     1  Black      75    NaN     235
Out[30]:

You can ask for it to be sorted in reverse order (aka descending order)


In [31]:
bf.sort(["Height", "Weight"], ascending=False)


               Bad   Eyes  Gender  Good   Hair  Height   Skin  Weight
rowName                                                              
Superman         0   Blue    Male     1  Black      75    NaN     235
Batman           0   Blue    Male     1  Black      74    NaN     210
Aquaman          0   Blue    Male     1  Blond      73    NaN     325
Green Lantern    0  Brown    Male     1  Brown      72    NaN     200
The Flash        0   Blue    Male     1  Blond      72    NaN     195
The Joker        1  Green    Male     0  Green      72  White     160
Wonder Woman     0   Blue  Female     1  Black      72    NaN     130
Power Girl       0   Blue  Female     1  Blond      71    NaN     180
Nightwing        0   Blue    Male     1  Black      70    NaN     175
Harley Quinn     1   Blue  Female     0  Blond      67    NaN     140
Catwoman         1  Green  Female     1  Black      67    NaN     133
Black Canary     0   Blue  Female     1  Blond      67    NaN     130
Zatanna          0   Blue  Female     1  Black      67    NaN     127
Oracle           0   Blue  Female     1    Red      67    NaN     126
Batwoman         0   Blue  Female     1  Black      66    NaN     123
Supergirl        0   Blue  Female     1  Blond      65    NaN     135
Out[31]:

Let's say you wanted to sort your data in different order for each column, then just provide a list of booleans to the parameter ascending and it will do just that


In [32]:
bf.sort(["Height", "Weight"], ascending=[False, True])


               Bad   Eyes  Gender  Good   Hair  Height   Skin  Weight
rowName                                                              
Superman         0   Blue    Male     1  Black      75    NaN     235
Batman           0   Blue    Male     1  Black      74    NaN     210
Aquaman          0   Blue    Male     1  Blond      73    NaN     325
Wonder Woman     0   Blue  Female     1  Black      72    NaN     130
The Joker        1  Green    Male     0  Green      72  White     160
The Flash        0   Blue    Male     1  Blond      72    NaN     195
Green Lantern    0  Brown    Male     1  Brown      72    NaN     200
Power Girl       0   Blue  Female     1  Blond      71    NaN     180
Nightwing        0   Blue    Male     1  Black      70    NaN     175
Oracle           0   Blue  Female     1    Red      67    NaN     126
Zatanna          0   Blue  Female     1  Black      67    NaN     127
Black Canary     0   Blue  Female     1  Blond      67    NaN     130
Catwoman         1  Green  Female     1  Black      67    NaN     133
Harley Quinn     1   Blue  Female     0  Blond      67    NaN     140
Batwoman         0   Blue  Female     1  Black      66    NaN     123
Supergirl        0   Blue  Female     1  Blond      65    NaN     135
Out[32]:

The characters are firstly order (ascending=False) on Height. Then, among those who are 67 inches high (for example), they are ordered from lightest to heaviest (ascending=True)

Fun with plots


In [33]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [34]:
from bokeh.plotting import *
from bokeh.charts import Histogram
from bokeh.models import HoverTool


BokehJS successfully loaded.

In [35]:
output_notebook()

In [36]:
from pylab import get_cmap
def color_picker(number, cmap='Paired'):    
    cm = get_cmap(cmap)
    colors = []
    for i in range(number):
        rgba = cm(1.*i/number)
        color = "#"
        for hue in rgba[:3]:
            f = format(int(hue*255),'x')
            if len(f) == 1:
                f = '0' + f
            color += f
        colors.append(color)
    return colors

Calling the values attribute on the columns will return a numpy array of the values only.


In [37]:
height = (bf["Height"] * 2.54)/100
height_val = height.values

weight = bf["Weight"]/2.2
weight_val = weight.values

bmi = weight/height**2
bmi_val = bmi.values

In [38]:
p = figure(plot_width=800, plot_height=800, title="Justice League", background_fill="#E5E5E5", 
           tools="hover")

source = ColumnDataSource(
    data=dict(
        height = height_val,
        weight = weight_val,
        hero=bf.rows,
        bmi = bmi_val
        )
    )    
male, female = color_picker(2)
colors = [male if gender=="Male" else female for gender in bf["Gender"]]
p.circle('height', 'weight', source=source, radius=5, radius_units="screen", color=colors)
hover = p.select(dict(type=HoverTool))
hover.tooltips = [
    ('Hero', '@hero'),
    ('Height', '@height'),
    ('Weight', '@weight'),
    ('BMI', '@bmi')
]

p.xaxis.axis_label="Height (Meter)"
p.yaxis.axis_label="Weight (Kg)"
show(p)


Pandas

Because we also love and respect Pandas, we also provide a way to export the BatFrame to a DataFrame


In [39]:
df = bf.toPandas()

In [40]:
type(df)


Out[40]:
pandas.core.frame.DataFrame

Putting it all together

Now that you've seen the basics, check out the Analyzing the Iris Flower Dataset demo to see how to use the BatFrame with MLDB.


In [ ]: